Here we will present Blitz Classifiers in Scikit-Learn.
The main idea here is to use a simple concept to choose the best algorithm that fit in your data.
Note the main funciton of Blitz Classifiers it's to simplify the initial algorithm and after that, you as a Machine Learning Engineer can choose the best algorithm that solve your problem considering complexity, scalability and knowledge.
First at all, let's import some useful libraries.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn import metrics
from sklearn.metrics import mean_squared_error
In this time we'll import the following classifiers of scikit-learn:
In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import LogisticRegression
Now we'll import a structured dataset that all columns are numeric.
In [3]:
credit = pd.read_csv('https://raw.githubusercontent.com/fclesio/learning-space/master/Datasets/02%20-%20Classification/default_credit_card.csv')
Let's see out dataset.
In [4]:
credit.head()
Out[4]:
As we can see, we have only numerical attributes. Below, let's see some correlations with our dependent variable (DEFAULT)
In [5]:
credit.corr()["DEFAULT"]
Out[5]:
in that part of the code, we'll select the features of our dataset to split the dataset in test and train sets.
In [6]:
features = credit.columns[1:24]
target = credit.columns[24:25]
In [7]:
# X_train: independent (target) variables for training data set
# Y_train: dependent (outcome) variable for training data set
# X_test: independent (target) variables for the test data set
# Y_test: dependent (outcome) variable for the test data set
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(
credit[features].values, credit['DEFAULT'].values, test_size=0.2, random_state=0)
Let's see the shape of our datasets.
In [8]:
print (X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)
Now, we'll instance our objects with the classifiers.
In [9]:
rfc = RandomForestClassifier(n_estimators=100, min_samples_leaf=10, random_state=1, n_jobs=2)
gbc = GradientBoostingClassifier()
etc = ExtraTreesClassifier()
abc = AdaBoostClassifier()
svc = SVC()
knc = KNeighborsClassifier()
dtc = DecisionTreeClassifier()
ptc = Perceptron()
lrc = LogisticRegression()
With the training sets, we'll fit all models for each classifier.
In [10]:
rfc.fit(X_train, Y_train)
gbc.fit(X_train, Y_train)
etc.fit(X_train, Y_train)
abc.fit(X_train, Y_train)
svc.fit(X_train, Y_train)
knc.fit(X_train, Y_train)
dtc.fit(X_train, Y_train)
ptc.fit(X_train, Y_train)
lrc.fit(X_train, Y_train)
Out[10]:
We'll build an object called expected with our target variables of training set. We'll use this to see the adherence of the model and see the errors.
In [11]:
expected = Y_train
Now we'll use the predict method over our training atributes to build every prediction object.
In [12]:
predicted_rfc = rfc.predict(X_train)
predicted_gbc = gbc.predict(X_train)
predicted_etc = etc.predict(X_train)
predicted_abc = abc.predict(X_train)
predicted_svc = svc.predict(X_train)
predicted_knc = knc.predict(X_train)
predicted_dtc = dtc.predict(X_train)
predicted_ptc = ptc.predict(X_train)
predicted_lrc = lrc.predict(X_train)
If you feel confortable to see every classification report, feel free to execute this code below (will be deprecated in next version).
In [13]:
print(metrics.classification_report(expected, predicted_rfc))
print(metrics.classification_report(expected, predicted_gbc))
print(metrics.classification_report(expected, predicted_etc))
print(metrics.classification_report(expected, predicted_abc))
print(metrics.classification_report(expected, predicted_svc))
print(metrics.classification_report(expected, predicted_knc))
print(metrics.classification_report(expected, predicted_dtc))
print(metrics.classification_report(expected, predicted_ptc))
print(metrics.classification_report(expected, predicted_lrc))
The same above applies for the confusion matrix for each classifier.
In [14]:
print(metrics.confusion_matrix(expected, predicted_rfc))
print(metrics.confusion_matrix(expected, predicted_gbc))
print(metrics.confusion_matrix(expected, predicted_etc))
print(metrics.confusion_matrix(expected, predicted_abc))
print(metrics.confusion_matrix(expected, predicted_svc))
print(metrics.confusion_matrix(expected, predicted_knc))
print(metrics.confusion_matrix(expected, predicted_dtc))
print(metrics.confusion_matrix(expected, predicted_ptc))
print(metrics.confusion_matrix(expected, predicted_lrc))
Now we'll predict with our test dataset to see the adherence of our models.
In [15]:
predictions_rfc = rfc.predict(X_test)
predictions_gbc = gbc.predict(X_test)
predictions_etc = etc.predict(X_test)
predictions_abc = abc.predict(X_test)
predictions_svc = svc.predict(X_test)
predictions_knc = knc.predict(X_test)
predictions_dtc = dtc.predict(X_test)
predictions_ptc = ptc.predict(X_test)
predictions_lrc = lrc.predict(X_test)
Let's store our Mean Squared Error for each classifier.
In [16]:
mse_rfc = mean_squared_error(predictions_rfc, Y_test)
mse_abc = mean_squared_error(predictions_abc, Y_test)
mse_etc = mean_squared_error(predictions_etc, Y_test)
mse_gbc = mean_squared_error(predictions_gbc, Y_test)
mse_svc = mean_squared_error(predictions_svc, Y_test)
mse_knc = mean_squared_error(predictions_knc, Y_test)
mse_dtc = mean_squared_error(predictions_dtc, Y_test)
mse_ptc = mean_squared_error(predictions_ptc, Y_test)
mse_lrc = mean_squared_error(predictions_lrc, Y_test)
Now the scores:
In [17]:
print('RMSE - Random Forests:',round(mse_rfc,3) )
print('RMSE - Gradient Boosting:',round(mse_gbc,3) )
print('RMSE - Extra Trees:',round(mse_etc,3) )
print('RMSE - Ada Boosting:',round(mse_abc,3) )
print('RMSE - SVM:',round(mse_svc,3) )
print('RMSE - KNN:',round(mse_knc,3) )
print('RMSE - Decision Trees:',round(mse_dtc,3) )
print('RMSE - Perceptron:',round(mse_ptc,3) )
print('RMSE - Logistic Regression:',round(mse_lrc,3) )
Ok, let's ranking our algorithms to see the best one to start our analysis.
In [18]:
algorithms = {'Algorithm': ['Random Forests', 'Gradient Boosting', 'Extra Trees', 'Ada Boosting', 'SVM', 'KNN', 'Decision Trees', 'Perceptron', 'Logistic Regression'],
'MSE': [round(mse_rfc,4), round(mse_gbc,4), round(mse_etc,4), round(mse_abc,4), round(mse_svc,4), round(mse_knc,4), round(mse_dtc,4), round(mse_ptc,4), round(mse_lrc,4)]}
# Transform in a data frame of Pandas to sorting
algos = pd.DataFrame(algorithms)
algos.sort_values(by='MSE', ascending=1)
Out[18]:
As we can see, the Gradient Boosting algorithm shows the best performance with default attributes for this dataset. We can start our analysis our development based in this algorithm.
There's a lot work to do, but this is the begining. Thanks for reading.